System and method for restoring data on demand for instant volume restoration

ABSTRACT

A technique is disclosed for restoring data of sparse volumes, where one or more block pointers within the file system structure are marked as ABSENT, and fetching the appropriate data from an alternate location on demand. Client data access requests to the local storage system initiate a restoration of the data from a backing store as required. A demand generator can also be used to restore the data as a background process by walking through the sparse volume and restoring the data of absent blocks. A pump module is also disclosed to regulate the access of the demand generator. Once all the data has been restored, the volume contains all data locally, and is no longer a sparse volume.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 11/409,626, by Lango et al., titled SYSTEM AND METHOD FORRESTORING DATA ON DEMAND FOR INSTANT VOLUME RESTORATION, filed on Apr.24, 2006, which claims the benefit of U.S. Provisional PatentApplication Ser. No. 60/674,430, which was filed on Apr. 25, 2005, byJason Ansel Lango for a SYSTEM AND METHOD FOR RESTORING DATA ON DEMANDFOR INSTANT VOLUME RESTORATION and is hereby incorporated by reference.

This application is a continuation in part application of U.S. Pat. No.7,197,490, entitled SYSTEM AND METHOD FOR LAZY-COPY SUB-VOLUME LOADBALANCING IN A NETWORK ATTACHED STORAGE POOL, by Robert M. English,issued Mar. 27, 2007, the contents of which are hereby incorporated byreference.

This application is also related to U.S. patent application Ser. No.11/409,887, entitled SYSTEM AND METHOD FOR SPARSE VOLUMES, by JasonLango, et al., and U.S. Pat. No. 7,689,609, issued on Mar. 30, 2010,entitled ARCHITECTURE FOR SUPPORT OF SPARSE VOLUMES, by Jason Lango etal., the contents of which are hereby incorporated by reference.

FIELD OF THE INVENTION

The present invention relates to file systems, and more specifically, toa file system that includes volumes having one or more files with absentblocks that can be restored on demand.

BACKGROUND OF THE INVENTION

A storage system typically comprises one or more storage devices intowhich information may be entered, and from which information may beobtained, as desired. The storage system includes a storage operatingsystem that functionally organizes the system by, inter alia, invokingstorage operations in support of a storage service implemented by thesystem. The storage system may be implemented in accordance with avariety of storage architectures including, but not limited to, anetwork-attached storage environment, a storage area network and a diskassembly directly attached to a client or host computer. The storagedevices are typically disk drives organized as a disk array, wherein theterm “disk” commonly describes a self-contained rotating magnetic mediastorage device. The term disk in this context is synonymous with harddisk drive (HDD) or direct access storage device (DASD).

Storage of information on the disk array is preferably implemented asone or more storage volumes of physical disks, defining an overalllogical arrangement of disk space. The disks within a volume aretypically organized as one or more groups, wherein each group may beoperated as a Redundant Array of Independent (or Inexpensive) Disks(RAID). Most RAID implementations enhance the reliability/integrity ofdata storage through the redundant writing of data stripes across agiven number of physical disks in the RAID group, and the appropriatestoring of redundant information (parity) with respect to the stripeddata. The physical disks of each RAID group may include disks configureto store striped data (i.e., data disks) and disks configure to storeparity for the data (i.e., parity disks). The parity may thereafter beretrieved to enable recovery of data lost when a disk fails. The term“RAID” and its various implementations are well-known and disclosed in ACase for Redundant Arrays of Inexpensive Disks (RAID), by D. A.Patterson, G. A. Gibson and R. H. Katz, Proceedings of the InternationalConference on Management of Data (SIGMOD), June 1988.

The storage operating system of the storage system may implement ahigh-level module, such as a file system, to logically organize theinformation stored on the disks as a hierarchical structure ofdirectories, files and blocks. For example, each on-disk file may beimplemented as set of data structures, i.e., disk blocks, configured tostore information, such as the actual data for the file. The data blocksmay be utilized to store both user data and also metadata within thefile system. These data blocks are organized within a volume blocknumber (vbn) space. The file system, which controls the use and contentsof blocks within the vbn space, organizes the data blocks within the vbnspace as a logical volume; each logical volume may be, although is notnecessarily, associated with its own file system. The file systemtypically consists of a contiguous range of vbns from zero to n-1, for afile system of size n blocks.

A known type of file system is a write-anywhere file system that doesnot over-write data on disks. If a data block is retrieved (read) fromdisk into a memory of the storage system and “dirtied” (i.e., updated ormodified) with new data, the data block is thereafter stored (written)to a new location on disk to optimize write performance. Awrite-anywhere file system may also opt to maintain a near optimallayout such that the data is substantially contiguously arranged ondisks. The optimal disk layout results in efficient access operations,particularly for sequential read operations, directed to the disks. Anexample of a write-anywhere file system that is configure to operate ona storage system is the Write Anywhere File Layout (WAFL™) file systemavailable from Network Appliance, Inc., Sunnyvale, Calif.

The storage operating system may further implement a storage module,such as a RAID system, that manages the storage and retrieval of theinformation to and from the disks in accordance with input/output (I/O)operations. The RAID system is also responsible for parity operations inthe storage system. Note that the file system only “sees” the data diskswithin its vbn space; the parity disks are hidden from the file systemand, thus, are only visible to the RAID system. The RAID systemtypically organizes the RAID groups into one large physical disk (i.e.,a physical volume), such that the disk blocks are concatenated acrossall disks of all RAID groups. The logical volume maintained by the filesystem is then “disposed over” (spread over) the physical volumemaintained by the RAID system.

The storage system may be configured to operate according to aclient/server model of information delivery to thereby allow manyclients to access the directories, files and blocks stored on thesystem. In this model, the client may comprise an application, such as adatabase application, executing on a computer that connects to thestorage system over a computer network, such as a point-to-point link,shared local area network, wide area network or virtual private networkimplemented over a public network, such as the Internet. Each client mayrequest the services of the file system by issuing file system protocolmessages (in the form of packets) to the storage system over thenetwork. By supporting a plurality of file system protocols, such as theconventional Common Internet File System (CIFS) and the Network FileSystem (NFS) protocols, the utility of the storage system is enhanced.

When accessing a block of a file in response to servicing a clientrequest, the file system specifies a vbn that is translated at the filesystem/RAID system boundary into a disk block number (dbn) location on aparticular disk (disk, dbn) within a RAID group of the physical volume.It should be noted that a client request is typically directed to aspecific file offset, which is then converted by the file system into afile block number (fbn), which represents a block offset into aparticular file. For example, if a file system is using 4 KB blocks, fbn6 of a file represents a block of data starting 24 KB into the file andextending to 28 KB, where fbn 7 begins. The fbn is converted to anappropriate vbn by the file system. Each block in the vbn space and inthe dbn space is typically fixed, e.g., 4K bytes (KB), in size;accordingly, there is typically a one-to-one mapping between theinformation stored on the disks in the dbn space and the informationorganized by the file system in the vbn space. The (disk, dbn) locationspecified by the RAID system is further translated by a disk driversystem of the storage operating system into a plurality of sectors(e.g., a 4 KB block with a RAID header translates to 8 or 9 disk sectorsof 512 or 520 bytes) on the specified disk.

The requested block is then retrieved from disk and stored in a buffercache of the memory as part of a buffer tree of the file. The buffertree is an internal representation of blocks for a file stored in thebuffer cache and maintained by the file system. Broadly stated, thebuffer tree has an inode at the root (top-level) of the file. An inodeis a data structure used to store information, such as metadata, about afile, whereas the data blocks are structures used to store the actualdata for the file. The information contained in an inode may include,e.g., ownership of the file, access permission for the file, size of thefile, file type and references to locations on disk of the data blocksfor the file. The references to the locations of the file data areprovided by pointers, which may further reference indirect blocks that,in turn, reference the data blocks, depending upon the quantity of datain the file. Each pointer may be embodied as a vbn to facilitateefficiency among the file system and the RAID system when accessing thedata on disks.

The RAID system maintains information about the geometry of theunderlying physical disks (e.g., the number of blocks in each disk) inraid labels stored on the disks. The RAID system provides the diskgeometry information to the file system for use when creating andmaintaining the vbn-to-disk,dbn mappings used to perform writeallocation operations and to translate vbns to disk locations for readoperations. Block allocation data structures, such as an active map, asnapmap, a space map and a summary map, are data structures thatdescribe block usage within the file system, such as the write-anywherefile system. These mapping data structures are independent of thegeometry and are used by a write allocator of the file system asexisting infrastructure for the logical volume. Examples of the blockallocation data structures are described in U.S. Pat. No. 7,454,445,titled WRITE ALLOCATION BASED ON STORAGE SYSTEM MAP AND SNAPSHOT, issuedon Nov. 18, 2008, by Blake Lewis et al., which is hereby incorporated byreference.

The write-anywhere file system typically performs write allocation ofblocks in a logical volume in response to an event in the file system(e.g., dirtying of the blocks in a file). When write allocating, thefile system uses the block allocation data structures to select freeblocks within its vbn space to which to write the dirty blocks. Theselected blocks are generally in the same positions along the disks foreach RAID group (i.e., within a stripe) so as to optimize use of theparity disks. Stripes of positional blocks may vary among other RAIDgroups to, e.g., allow overlapping of parity update operations. Whenwrite allocating, the file system traverses a small portion of each disk(corresponding to a few blocks in depth within each disk) to essentiallylay down a plurality of stripes per RAID group. In particular, the filesystem chooses vbns that are on the same stripe per RAID group duringwrite allocation using the vbn-to-disk,dbn mappings.

During storage system operation, a volume (or other data container, suchas a file or directory) may become corrupted due to, e.g., physicaldamage to the underlying storage devices, software errors in the storageoperating system executing on the storage system or an improperlyexecuting application program that modifies data in the volume. In suchsituations, an administrator may want to ensure that the volume ispromptly mounted and exported so that it is accessible to clients asquickly as possible; this requires that the data in the volume (whichmay be substantial) be recovered as soon as possible. Often, the data inthe volume may be recovered by, e.g., reconstructing the data usingstored parity information if the storage devices are utilized in a RAIDconfiguration. Here, reconstruction may occur on-the-fly, resulting invirtually no discernable time where the data is not accessible.

In other situations, reconstruction of the data may not be possible. Asa result, the administrator has several options, one of which is toinitiate a conventional full restore operation invoking a direct copy ofthe volume from a point-in-time image stored on another storage system.In the general case, all volume data and metadata must be copied, priorto resuming normal operations, as a guarantee of applicationconsistency. The time taken to complete a full copy of the data is oftencostly in terms of lost opportunity to run business-criticalapplications. However, such “brute force” data copying is generallyinefficient, as the time required to transfer substantial amounts ofdata, e.g., terabytes, may be on the order of days. Similardisadvantages are associated with restoring data from a tape device orother offline data storage. Another option that enables an administratorto rapidly mount and export a volume is to generate a hole-filledvolume, wherein the contents of the volume are “holes”. In this context,holes are manifested as entire blocks of zeros or other predefinedpointer values stored within the buffer tree structure of a volume. Anexample of the use of such holes is described in the U.S. Pat. No.7,457,982, entitled WRITABLE READ-ONLY SNAPSHOTS, by Vijayan Rajan, thecontents of which are hereby incorporated by reference.

In such a hole-filled environment, the actual data is not retrieved froma backing store until requested by a client. However, a noteddisadvantage of such a hole-based technique is that repeated writeoperations are needed to generate the appropriate number of zero-filledblocks on disk for the volume. That is, the use of holes to implement adata container that requires additional retrieval operations to retrievedata further requires that the entire buffer tree of a file and/orvolume be written to disk during creation. The time required to performthe needed write operations may be substantial depending on the size ofthe volume or file. Thus, creation of a hole-filled volume is oftentimesimpractical due to the need for quick data access to a volume.

A storage environment in which there is typically a need to quicklybring back (or restore) a volume involves the use of a near line storageserver. As used herein, the term “near line storage server” means asecondary storage system adapted to store data forwarded from one ormore primary storage systems, typically for long term archival purposes.The near line storage server may be utilized in such a storageenvironment to provide a back up of data storage (e.g., a volume) servedby each primary storage system. As a result, the near line storageserver is typically optimized to perform bulk data restore operations,but suffers reduced performance when serving individual client dataaccess requests. This latter situation may arise where a primary storagesystem encounters a failure that damages its volume in such a mannerthat a client must send its data access requests to the server in orderto access data in the volume. This situation also forces the clients toreconfigure with appropriate network addresses associated with the nearline storage server to enable such data access.

SUMMARY OF THE INVENTION

The present invention overcomes the disadvantages of the prior art byproviding a system and method for instantiating a sparse volume within afile system of a storage system that is used to restore data from asecondary storage system (backing store) on demand. As described herein,a sparse volume contains one or more files with at least one data block(i.e., an absent block) that is not stored locally on disk (i.e., on alocal volume) coupled to the storage system. By not immediatelyretrieving the data block (or a block of zeros as in a holeenvironment), the sparse volume may be generated and exported quicklywith minimal write operations required. The missing data of an absentblock is stored on the alternate, possibly remote, backing store and isillustratively retrieved using a remote fetch operation. Once therestored volume is activated, the volume may be accessed for any fileoperations, including new write operations. Received write operationsare processed normally by allocating a new block and modifying a blockpointer to reference the newly allocated data block. If the blockpointer was previously marked as absent, it is overwritten as the olddata that was remotely stored has been updated, which results in thestorage system not needing to remotely retrieve the data.

In the illustrative embodiment, a sparse volume is initialized withvolume infrastructure metadata that utilizes special pointers to datastored on the backing store. In the illustrative embodiment, specialpointers (ABSENT pointers) are utilized to indicate that the datarequired a special retrieval operation. Use of these ABSENT pointerspresent a user, such as a client, with the illusion of an “instant” fullrestore, thereby avoiding the long wait associated with a conventionalfull restore operation. The data may then be “restored on demand,” whichas used herein denotes waiting until a specific request for the data isissued before expending storage system and network resources to acquirethe data. Such restoration of data may be accomplished in response to aclient issuing a data access request to the storage system, or by arestore module of the system generating a request (“demand”) for thedata during, e.g., background processing.

One feature of the present invention is that once the restoration hasbegun, the sparse volume is available for all operations, including,e.g., accepting new modifications (write operations) directed to thesparse volumes. These write operations are written to the sparse volumeand any new pointers that are written overwrite any ABSENT pointers tothereby signify that if a read operation is received the data should beretrieved from the sparse volume and not from the backing store. Thus,if a particular block is labeled SPARSE and a write operation isdirected to it, the block is no longer labeled sparse. Any subsequentread operations will return the newly written data and will not requirea remote fetch operation.

According to an aspect of the invention, the restore module is embodiedas a novel demand generator configured to scan the sparse volume,searching for blocks with ABSENT pointers. Upon locating such a block,the demand generator initiates a remote fetch operation to retrieve themissing data referenced by each ABSENT pointer from the backing store.The retrieved data is then write allocated to populate the sparsevolume. Population of the sparse volume with missing data preferablyoccurs in connection with a multi-phase projected sequence until thereare no absent blocks remaining in the file system. Illustratively, thesephases include the inode file, directories and files. In alternativeembodiments, the phases may be the inode file, special data containers,directories and files. The special data containers may comprise, forexample, hidden or file system metadata containers such as specialdirectories. At this time, the sparse volume transitions to a fullyrestored detached local volume. The demand generator may also beconfigured to utilize a special load path that bypasses a buffer cacheof the storage system so as not to “pollute” that cache with retrieveddata not currently needed by the client. In addition, the demandgenerator may implement a read-ahead feature to enhance retrieval ofdata associated with a sequence of remote fetch operations.

According to another aspect of the present invention, a pump module ofthe storage system provides flow control to the demand generator. In theevent the number of outstanding demands and requests for data missingfrom the sparse volume reaches a predetermined threshold, the pumpmodule regulates the demand generator to slow down or temporarily pausethe number of demands it generates. The pump module may furtherimplement a priority policy that, e.g., grants precedence to clientissued requests over generated demands for missing data in the eventavailable system resources are limited.

Advantageously, a sparse volume of a storage system may be instantiatedto quickly restore a local volume that has failed. To that end, thedemand generator and pump modules cooperate to permit efficient accessto data that is not physically stored on the storage system withoutrequiring transfer of an entire copy of the local volume before servingdata access requests. Moreover, the novel modules ensure that allmissing data is eventually restored to the sparse volume, bringing it toa fully detached volume state in an efficient manner.

Another advantage of the present invention is that backup operations tothe remote backing store may be resumed while a restore operation isongoing. This enables new client updates to be backed up, which enableslater restoration should a second disaster recovery operation need to beinitiated.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be betterunderstood by referring to the following description in conjunction withthe accompanying drawings in which like reference numerals indicateidentical or functionally similar elements:

FIG. 1 is a schematic block diagram of an exemplary network environmentin accordance with an embodiment of the present invention;

FIG. 2 is a schematic block diagram of an exemplary storage operatingsystem in accordance with an embodiment of the present invention;

FIG. 3 is a schematic block diagram of an exemplary inode in accordancewith an embodiment of the present invention;

FIG. 4 is a schematic block diagram of an exemplary buffer tree inaccordance with an embodiment of the present invention;

FIG. 5 is a schematic block diagram of an illustrative embodiment of abuffer tree of a file that may be advantageously used with the presentinvention;

FIG. 6 is a schematic block diagram of an exemplary aggregate inaccordance with an embodiment of the present invention;

FIG. 7 is a schematic block diagram of an exemplary on-disk layout inaccordance with an embodiment of the present invention;

FIG. 8 is a schematic block diagram of an exemplary fsinfo block inaccordance with an embodiment of the present invention;

FIG. 9 is a flow chart detailing the steps of a procedure for processinga data access request in accordance with an embodiment of the presentinvention;

FIG. 10 is a flow chart detailing the steps of a procedure for restoringa failed volume in accordance with an embodiment of the presentinvention;

FIG. 11 is a flow chart detailing the steps of a procedure for operatinga demand generator in accordance with an embodiment of the presentinvention;

FIG. 12 is a flow chart detailing the steps of a projected sequencetraversed by a scanner in accordance with an embodiment of the presentinvention; and

FIG. 13 is a flow chart detailing the steps of a procedure forimplementing flow control at a pump module in accordance with anembodiment of the present invention.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

A. Network Environment

FIG. 1 is a schematic block diagram of an environment 100 including astorage system 120 a that may be advantageously used with the presentinvention. The storage system is a computer that provides storageservice relating to the organization of information on storage devices,such as disks 130 of a disk array 160. The storage system 120 acomprises a processor 122, a memory 124, a network adapter 126 and astorage adapter 128 interconnected by a system bus 125. The storagesystem 120 a also includes a storage operating system 200 thatpreferably implements a high-level module, such as a file system, tologically organize the information as a hierarchical structure ofdirectories, files and special types of files called virtual disks(hereinafter “blocks”) on the disks.

In the illustrative embodiment, the memory 124 comprises storagelocations that are addressable by the processor and adapters for storingsoftware program code. A portion of the memory may be further organizedas a buffer cache 170 for storing certain data structures associatedwith the present invention. The processor and adapters may, in turn,comprise processing elements and/or logic circuitry configured toexecute the software code and manipulate the data structures. Storageoperating system 200, portions of which are typically resident in memoryand executed by the processing elements, functionally organizes thesystem 120 a by, inter alia, invoking storage operations executed by thestorage system. It will be apparent to those skilled in the art thatother processing and memory means, including various computer readablemedia, may be used for storing and executing program instructionspertaining to the inventive technique described herein.

The network adapter 126 comprises the mechanical, electrical andsignaling circuitry needed to connect the storage system 120 a to aclient 110 over a computer network 140, which may comprise apoint-to-point connection or a shared medium, such as a local areanetwork (LAN) or wide area network (WAN). Illustratively, the computernetwork 140 may be embodied as an Ethernet network or a Fibre Channel(FC) network. The client 110 may communicate with the storage systemover network 140 by exchanging discrete frames or packets of dataaccording to pre-defined protocols, such as the Transmission ControlProtocol/Internet Protocol (TCP/IP).

The client 110 may be a general-purpose computer configured to executeapplications 112. Moreover, the client 110 may interact with the storagesystem 120 a in accordance with a client/server model of informationdelivery. That is, the client may request the services of the storagesystem, and the system may return the results of the services requestedby the client, by exchanging packets 150 over the network 140. Theclients may issue packets including file-based access protocols, such asthe Common Internet File System (CIFS) protocol or Network File System(NFS) protocol, over TCP/IP when accessing information in the form offiles and directories. Alternatively, the client may issue packetsincluding block-based access protocols, such as the Small ComputerSystems Interface (SCSI) protocol encapsulated over TCP (iSCSI) and SCSIencapsulated over Fibre Channel (FCP), when accessing information in theform of blocks.

The storage adapter 128 cooperates with the storage operating system 200executing on the system 120 a to access information requested by a user(or client). The information may be stored on any type of attached arrayof writable storage device media such as video tape, optical, DVD,magnetic tape, bubble memory, electronic random access memory,micro-electro mechanical and any other similar media adapted to storeinformation, including data and parity information. However, asillustratively described herein, the information is preferably stored onthe disks 130, such as HDD and/or DASD, of array 160. The storageadapter includes input/output (I/O) interface circuitry that couples tothe disks over an I/O interconnect arrangement, such as a conventionalhigh-performance, FC serial link topology.

Storage of information on array 160 is preferably implemented as one ormore storage “volumes” that comprise a collection of physical storagedisks 130 cooperating to define an overall logical arrangement of volumeblock number (vbn) space on the volume(s). Each logical volume isgenerally, although not necessarily, associated with its own filesystem. The disks within a logical volume/file system are typicallyorganized as one or more groups, wherein each group may be operated as aRedundant Array of Independent (or Inexpensive) Disks (RAID). Most RAIDimplementations, such as a RAID-4 level implementation, enhance thereliability/integrity of data storage through the redundant writing ofdata stripes across a given number of physical disks in the RAID group,and the appropriate storing of parity information with respect to thestriped data. An illustrative example of a RAID implementation is aRAID-4 level implementation, although it should be understood that othertypes and levels of RAID implementations may be used in accordance withthe inventive principles described herein.

Additionally, a second storage system 120 b is operativelyinterconnected with the network 140. The second storage system 120 b maybe configured as a near line storage server. The storage system 120 bgenerally comprises hardware similar to storage system 120 a; however,it may alternatively execute a modified storage operating system thatadapts the storage system for use as a near line storage server. Inalternate embodiments, there may be a plurality of additional storagesystems (generally referred to herein as 120) in environment 100.

B. Storage Operating System

To facilitate access to the disks 130, the storage operating system 200implements a write-anywhere file system that cooperates withvirtualization modules to virtualize the storage space provided by disks130. The file system logically organizes the information as ahierarchical structure of named directories and files on the disks. Eachon-disk file may be implemented as set of disk blocks configure to storeinformation, such as data, whereas the directory may be implemented as aspecially formatted file in which names and links to other files anddirectories are stored. The virtualization modules allow the file systemto further logically organize information as a hierarchical structure ofblocks on the disks that are exported as named logical unit numbers(luns).

In the illustrative embodiment, the storage operating system ispreferably the NetApp® Data ONTAP™ operating system available fromNetwork Appliance, Inc., Sunnyvale, Calif. that implements a WriteAnywhere File Layout (WAFL™) file system. However, it is expresslycontemplated that any appropriate storage operating system may beenhanced for use in accordance with the inventive principles describedherein. As such, where the term “WAFL” is employed, it should be takenbroadly to refer to any file system that is otherwise adaptable to theteachings of this invention.

FIG. 2 is a schematic block diagram of the storage operating system 200that may be advantageously used with the present invention. The storageoperating system comprises a series of software layers organized to forman integrated network protocol stack or, more generally, amulti-protocol engine that provides data paths for clients to accessinformation stored on the storage system using block and file accessprotocols. The protocol stack includes a media access layer 210 ofnetwork drivers (e.g., gigabit Ethernet drivers) that interfaces tonetwork protocol layers, such as the IP layer 212 and its supportingtransport mechanisms, the TCP layer 214 and the User Datagram Protocol(UDP) layer 216. A file system protocol layer provides multi-protocolfile access and, to that end, includes support for the Direct AccessFile System (DAFS) protocol 218, the NFS protocol 220, the CIFS protocol222 and the Hypertext Transfer Protocol (HTTP) protocol 224. A VI layer226 implements the VI architecture to provide direct access transport(DAT) capabilities, such as RDMA, as required by the DAFS protocol 218.

An iSCSI driver layer 228 provides block protocol access over the TCP/IPnetwork protocol layers, while a FC driver layer 230 receives andtransmits block access requests and responses to and from the storagesystem. The FC and iSCSI drivers provide FC-specific and iSCSI-specificaccess control to the blocks and, thus, manage exports of luns to eitheriSCSI or FCP or, alternatively, to both iSCSI and FCP when accessing theblocks on the storage system. In addition, the storage operating systemincludes a storage module embodied as a RAID system 240 that manages thestorage and retrieval of information to and from the volumes/disks inaccordance with I/O operations, and a disk driver system 250 thatimplements a disk access protocol such as, e.g., the SCSI protocol.

The storage operating system 200 further comprises an NRV protocol layer295 that interfaces with file system 280. The Network Appliance RemoteVolume (NRV) protocol is generally utilized for remote fetching of datablocks that are not stored locally on disk. However, as describedherein, the NRV protocol may be further utilized in storagesystem-to-storage system communication to fetch absent blocks in asparse volume in accordance with the principles of the presentinvention. It should be noted that, in alternate embodiments,conventional file/block level protocols, such as the NFS protocol, orother proprietary block fetching protocols may be used in place of theNRV protocol within the teachings of the present invention.

In accordance with the present invention, and as described in furtherdetail herein, a demand generator 296 of the storage operating system200 is used to systematically retrieve data blocks that are not storedlocally on disk, i.e., on a local volume of storage system 120 a, whilea pump module 298 may be used to regulate the retrieval of those datablocks. Although they are shown and described herein as separatesoftware modules, the demand generator 296 and the pump module 298 maybe alternatively integrated within a single module of the operatingsystem. Moreover, it should be noted that the demand generator and thepump module may be implemented as hardware, software, firmware, or anycombination thereof.

Bridging the disk software layers with the integrated network protocolstack layers is a virtualization system that is implemented by a filesystem 280 interacting with virtualization modules illustrativelyembodied as, e.g., vdisk module 290 and SCSI target module 270. Thevdisk module 290 is layered on the file system 280 to enable access byadministrative interfaces, such as a user interface (UI) 275, inresponse to a user (such as a system administrator) issuing commands tothe storage system. The SCSI target module 270 is disposed between theFC and iSCSI drivers 228, 230 and the file system 280 to provide atranslation layer of the virtualization system between the block (lun)space and the file system space, where luns are represented as blocks.The UI 275 is disposed over the storage operating system in a mannerthat enables administrative or user access to the various layers andsystems.

The file system is illustratively a message-based system that provideslogical volume management capabilities for use in access to theinformation stored on the storage devices, such as disks. That is, inaddition to providing file system semantics, the file system 280provides functions normally associated with a volume manager. Thesefunctions include (i) aggregation of the disks, (ii) aggregation ofstorage bandwidth of the disks, and (iii) reliability guarantees, suchas minoring and/or parity (RAID). The file system 280 illustrativelyimplements the WAFL file system (hereinafter generally the“write-anywhere file system”) having an on-disk format representationthat is block-based using, e.g., 4 kilobyte (KB) blocks and using indexnodes (“inodes”) to identify files and file attributes (such as creationtime, access permissions, size and block location). The file system usesfiles to store metadata describing the layout of its file system; thesemetadata files include, among others, an inode file. A file handle,i.e., an identifier that includes an inode number, is used to retrievean inode from disk.

Broadly stated, all inodes of the write-anywhere file system areorganized into the inode file. A file system (fs) info block specifiesthe layout of information in the file system and includes an inode of afile that includes all other inodes of the file system. Each logicalvolume (file system) has an fsinfo block that is preferably stored at afixed location within, e.g., a RAID group. The inode of the root fsinfoblock may directly reference (point to) blocks of the inode file or mayreference indirect blocks of the inode file that, in turn, referencedirect blocks of the inode file. Within each direct block of the inodefile are embedded inodes, each of which may reference indirect blocksthat, in turn, reference data blocks of a file.

Operationally, a request from the client 110 is forwarded as a packet150 over the computer network 140 and onto the storage system 120 awhere it is received at the network adapter 126. A network driver (oflayer 210 or layer 230) processes the packet and, if appropriate, passesit on to a network protocol and file access layer for additionalprocessing prior to forwarding to the write-anywhere file system 280.Here, the file system generates operations to load (retrieve) therequested data from disk 130 if it is not resident “in core”, i.e., inthe buffer cache 170. Illustratively this operation may be embodied as aLoad_Block( ) function 284 of the file system 280. If the information isnot in the cache, the file system 280 indexes into the inode file usingthe inode number to access an appropriate entry and retrieve a logicalvbn. The file system then passes a message structure including thelogical vbn to the RAID system 240; the logical vbn is mapped to a diskidentifier and disk block number (disk,dbn) and sent to an appropriatedriver (e.g., SCSI) of the disk driver system 250. The disk driveraccesses the dbn from the specified disk 130 and loads the requesteddata block(s) in buffer cache 170 for processing by the storage system.Upon completion of the request, the storage system (and operatingsystem) returns a reply to the client 110 over the network 140.

The file system 280 generally provides the Load_Block( ) function 284 toretrieve one or more blocks from disk. These blocks may be retrieved inresponse to a read request or an exemplary read ahead algorithm directedto, e.g., a file. As described further herein, if any requested blockswithin a buffer tree of the file contain a special ABSENT value (therebydenoting absent blocks), then the Load_Block( ) function 284 initiates afetch operation to retrieve the absent blocks from an appropriatebacking store using the illustrative NRV protocol 295. Once the blocks(including any data blocks) have been retrieved, the Load_Block( )function 284 returns with the requested data. The NRV protocol isfurther described in the above-referenced U.S. Patent Application,entitled ARCHITECTURE FOR SUPPORT OF SPARSE VOLUMES, by Jason Lango etal. However, it should be noted that any other suitable file or blockbased protocol that can retrieve data from a remote backing store,including, e.g., the NFS protocol, can be advantageously used with thepresent invention. The file system also illustratively includes aLoad_Inode( ) function 292 that retrieves inode and file geometry whenfirst accessing a file.

It should be further noted that the software path through the storageoperating system layers described above needed to perform data storageaccess for the client request received at the storage system mayalternatively be implemented in hardware. That is, in an alternateembodiment of the invention, a storage access request data path may beimplemented as logic circuitry embodied within a field programmable gatearray (FPGA) or an application specific integrated circuit (ASIC). Thistype of hardware implementation increases the performance of the storageservice provided by storage system 120 in response to a request issuedby client 110. Moreover, in another alternate embodiment of theinvention, the processing elements of adapters 126, 128 may be configureto offload some or all of the packet processing and storage accessoperations, respectively, from processor 122, to thereby increase theperformance of the storage service provided by the system. It isexpressly contemplated that the various processes, architectures andprocedures described herein can be implemented in hardware, firmware orsoftware.

As used herein, the term “storage operating system” generally refers tothe computer-executable code operable to perform a storage function in astorage system, e.g., that manages data access and may, in the case of afile server, implement file system semantics. In this sense, the ONTAPsoftware is an example of such a storage operating system implemented asa microkernel and including the WAFL layer to implement the WAFL filesystem semantics and manage data access. The storage operating systemcan also be implemented as an application program operating over ageneral-purpose operating system, such as UNIX® or Windows NT®, or as ageneral-purpose operating system with configurable functionality, whichis configure for storage applications as described herein.

In addition, it will be understood to those skilled in the art that theinventive system and method described herein may apply to any type ofspecial-purpose (e.g., file server, filer or multi-protocol storageappliance) or general-purpose computer, including a standalone computeror portion thereof, embodied as or including a storage system 120. Anexample of a multi-protocol storage appliance that may be advantageouslyused with the present invention is described in U.S. patent applicationSer. No. 10/215,917 titled MULTI-PROTOCOL STORAGE APPLIANCE THATPROVIDES INTEGRATED SUPPORT FOR FILE AND BLOCK ACCESS PROTOCOLS, filedon Aug. 8, 2002, now published as U.S. Patent Publication No.2004/0030668 A1 on Feb. 12, 2004. Moreover, the teachings of thisinvention can be adapted to a variety of storage system architecturesincluding, but not limited to, a network-attached storage environment, astorage area network and disk assembly directly-attached to a client orhost computer. The term “storage system” should therefore be takenbroadly to include such arrangements in addition to any subsystemsconfigure to perform a storage function and associated with otherequipment or systems.

C. File System Organization

In the illustrative embodiment, a file is represented in thewrite-anywhere file system as an inode data structure adapted forstorage on the disks 130. FIG. 3 is a schematic block diagram of aninode 300, which preferably includes a metadata section 310 and a datasection 350. The information stored in the metadata section 310 of eachinode 300 describes the file and, as such, includes the type (e.g.,regular, directory, virtual disk) 312 of file, the size 314 of the file,time stamps (e.g., access and/or modification) 316 for the file andownership, i.e., user identifier (UID 318) and group ID (GID 320), ofthe file. The contents of the data section 350 of each inode, however,may be interpreted differently depending upon the type of file (inode)defined within the type field 312. For example, the data section 350 ofa directory inode contains metadata controlled by the file system,whereas the data section of a regular inode contains file system data.In this latter case, the data section 350 includes a representation ofthe data associated with the file.

Specifically, the data section 350 of a regular on-disk inode mayinclude file system data or pointers, the latter referencing 4 KB datablocks on disk used to store the file system data. Each pointer ispreferably a logical vbn to facilitate efficiency among the file systemand the RAID system 240 when accessing the data on disks. Given therestricted size (e.g., 128 bytes) of the inode, file system data havinga size that is less than or equal to 64 bytes is represented, in itsentirety, within the data section of that inode. However, if the filesystem data is greater than 64 bytes but less than or equal to 64 KB,then the data section of the inode (e.g., a first level inode) comprisesup to 16 pointers, each of which references a 4 KB block of data on thedisk.

Moreover, if the size of the data is greater than 64 KB but less than orequal to 64 megabytes (MB), then each pointer in the data section 350 ofthe inode (e.g., a second level inode) references an indirect block(e.g., a first level block) that contains up to 1024 pointers, each ofwhich references a 4 KB data block on disk. For file system data havinga size greater than 64 MB, each pointer in the data section 350 of theinode (e.g., a third level inode) references a double-indirect block(e.g., a second level block) that contains up to 1024 pointers, eachreferencing an indirect (e.g., a first level) block. The indirect block,in turn, contains 1024 pointers, each of which references a 4 KB datablock on disk. When accessing a file, each block of the file may beloaded from disk 130 into the buffer cache 170.

When an on-disk inode (or block) is loaded from disk 130 into buffercache 170, its corresponding in core structure embeds the on-diskstructure. For example, the dotted line surrounding the inode 300 (FIG.3) indicates the in core representation of the on-disk inode structure.The in core structure is a block of memory that stores the on-diskstructure plus additional information needed to manage data in thememory (but not on disk). The additional information may include, e.g.,a dirty bit 360. After data in the inode (or block) is updated/modifiedas instructed by, e.g., a write operation, the modified data is markeddirty using the dirty bit 360 so that the inode (block) can besubsequently “flushed” (stored) to disk. The in core and on-disk formatstructures of the WAFL file system, including the inodes and inode file,are disclosed and described in the previously incorporated U.S. Pat. No.5,819,292 titled METHOD FOR MAINTAINING CONSISTENT STATES OF A FILESYSTEM AND FOR CREATING USER-ACCESSIBLE READ-ONLY COPIES OF A FILESYSTEM by David Hitz et al., issued on Oct. 6, 1998.

FIG. 4 is a schematic block diagram of an embodiment of a buffer tree ofa file that may be advantageously used with the present invention. Thebuffer tree is an internal representation of blocks for a file (e.g.,file 400) loaded into the buffer cache 170 and maintained by thewrite-anywhere file system 280. A root (top-level) inode 402, such as anembedded inode, references indirect (e.g., level 1) blocks 404. Notethat there may be additional levels of indirect blocks (e.g., level 2,level 3) depending upon the size of the file. The indirect blocks (andinode) contain pointers 405 that ultimately reference data blocks 406used to store the actual data of the file. That is, the data of file 400are contained in data blocks and the locations of these blocks arestored in the indirect blocks of the file. Each level 1 indirect block404 may contain pointers to as many as 1024 data blocks. According tothe “write anywhere” nature of the file system, these blocks may belocated anywhere on the disks 130.

A file system layout is provided that apportions an underlying physicalvolume into one or more virtual volumes (vvols) of a storage system. Anexample of such a file system layout is described in U.S. Pat. No.7,409,494 on Aug. 5, 2008 titled EXTENSION OF WRITE ANYWHERE FILE SYSTEMLAYOUT, by John K. Edwards et al. and assigned to Network Appliance,Inc. The underlying physical volume is an aggregate comprising one ormore groups of disks, such as RAID groups, of the storage system. Theaggregate has its own physical volume block number (pvbn) space andmaintains metadata, such as block allocation structures, within thatpvbn space. Each vvol has its own virtual volume block number (vvbn)space and maintains metadata, such as block allocation structures,within that vvbn space. Each vvol is a file system that is associatedwith a container file; the container file is a file in the aggregatethat contains all blocks used by the vvol. Moreover, each vvol comprisesdata blocks and indirect blocks that contain block pointers that pointat either other indirect blocks or data blocks.

In one embodiment, pvbns are used as block pointers within buffer treesof files (such as file 400) stored in a vvol. This “hybrid” vvolembodiment involves the insertion of only the pvbn in the parentindirect block (e.g., Mode or indirect block). On a read path of alogical volume, a “logical” volume (vol) info block has one or morepointers that reference one or more fsinfo blocks, each of which, inturn, points to an Mode file and its corresponding Mode buffer tree. Theread path on a vvol is generally the same, following pvbns (instead ofvvbns) to find appropriate locations of blocks; in this context, theread path (and corresponding read performance) of a vvol issubstantially similar to that of a physical volume. Translation frompvbn-to-disk,dbn occurs at the file system/RAID system boundary of thestorage operating system 200.

In an illustrative dual vbn hybrid (“flexible”) vvol embodiment, both apvbn and its corresponding vvbn are inserted in the parent indirectblocks in the buffer tree of a file. That is, the pvbn and vvbn arestored as a pair for each block pointer in most buffer tree structuresthat have pointers to other blocks, e.g., level 1 (L1) indirect blocks,Mode file level 0 (L0) blocks. FIG. 5 is a schematic block diagram of anillustrative embodiment of a buffer tree of a file 500 that may beadvantageously used with the present invention. A root (top-level) Mode502, such as an embedded Mode, references indirect (e.g., level 1)blocks 504. Note that there may be additional levels of indirect blocks(e.g., level 2, level 3) depending upon the size of the file. Theindirect blocks (and inode) contain pvbn/vvbn pointer pair structures508 that ultimately reference data blocks 506 used to store the actualdata of the file.

The pvbns reference locations on disks of the aggregate, whereas thevvbns reference locations within files of the vvol. The use of pvbns asblock pointers 508 in the indirect blocks 504 provides efficiencies inthe read paths, while the use of vvbn block pointers provides efficientaccess to required metadata. That is, when freeing a block of a file,the parent indirect block in the file contains readily available vvbnblock pointers, which avoids the latency associated with accessing anowner map to perform pvbn-to-vvbn translations; yet, on the read path,the pvbn is available.

As noted, each inode has 64 bytes in its data section that, dependingupon the size of the inode file (e.g., greater than 64 bytes of data),function as block pointers to other blocks. For traditional and hybridvolumes, those 64 bytes are embodied as 16 block pointers, i.e., sixteen(16) 4 byte block pointers. For the illustrative dual vbn flexiblevolume, the 64 bytes of an inode are embodied as eight (8) pairs of 4byte block pointers, wherein each pair is a vvbn/pvbn pair. In addition,each indirect block of a traditional or hybrid volume may contain up to1024 (pvbn) pointers; each indirect block of a dual vbn flexible volume,however, has a maximum of 510 (pvbn/vvbn) pairs of pointers.

Moreover, one or more of pointers 508 may contain a special ABSENT valueto signify that the object(s) (e.g., an indirect block or data block)referenced by the pointer(s) is not locally stored (e.g., on the volume)and, thus, must be fetched (retrieved) from an alternate backing store.In the illustrative embodiment, the Load_Block( ) function interpretsthe content of the each pointer and, if a requested block is ABSENT,initiates transmission of an appropriate request (e.g., a remote fetchoperation) for the data to a backing store using, e.g. the NRV protocol.

FIG. 6 is a schematic block diagram of an embodiment of an aggregate 600that may be advantageously used with the present invention. Luns(blocks) 602, directories 604, qtrees 606 and files 608 may be containedwithin vvols 610, such as dual vbn flexible vvols, that, in turn, arecontained within the aggregate 600. The aggregate 600 is illustrativelylayered on top of the RAID system, which is represented by at least oneRAID plex 650 (depending upon whether the storage configuration ismirrored), wherein each plex 650 comprises at least one RAID group 660.Each RAID group further comprises a plurality of disks 630, e.g., one ormore data (D) disks and at least one (P) parity disk.

Whereas the aggregate 600 is analogous to a physical volume of aconventional storage system, a vvol is analogous to a file within thatphysical volume. That is, the aggregate 600 may include one or morefiles, wherein each file contains a vvol 610 and wherein the sum of thestorage space consumed by the vvols is physically smaller than (or equalto) the size of the overall physical volume. The aggregate utilizes aphysical pvbn space that defines a storage space of blocks provided bythe disks of the physical volume, while each embedded vvol (within afile) utilizes a logical vvbn space to organize those blocks, e.g., asfiles. Each vvbn space is an independent set of numbers that correspondsto locations within the file, which locations are then translated todbns on disks. Since the vvol 610 is also a logical volume, it has itsown block allocation structures (e.g., active, space and summary maps)in its vvbn space.

A container file is a file in the aggregate that contains all blocksused by a vvol. The container file is an internal (to the aggregate)feature that supports a vvol; illustratively, there is one containerfile per vvol. Similar to a pure logical volume in a file approach, thecontainer file is a hidden file (not accessible to a user) in theaggregate that holds every block in use by the vvol. The aggregateincludes an illustrative hidden metadata root directory that containssubdirectories of vvols:

-   -   WAFL/fsid/filesystem file, storage label file

Specifically, a physical file system (WAFL) directory includes asubdirectory for each vvol in the aggregate, with the name ofsubdirectory being a file system identifier (fsid) of the vvol. Eachfsid subdirectory (vvol) contains at least two files, a filesystem fileand a storage label file. The storage label file is illustratively a 4kB file that contains metadata similar to that stored in a conventionalraid label. In other words, the storage label file is the analog of araid label and, as such, contains information about the state of thevvol such as, e.g., the name of the vvol, a universal unique identifier(uuid) and fsid of the vvol, whether it is online, being created orbeing destroyed, etc.

FIG. 7 is a schematic block diagram of an on-disk representation of anaggregate 700. The storage operating system 200, e.g., the RAID system240, assembles a physical volume of pvbns to create the aggregate 700,with pvbns 1 and 2 comprising a “physical” volinfo block 702 for theaggregate. The volinfo block 702 contains block pointers to fsinfoblocks 704, each of which may represent a snapshot of the aggregate.Each fsinfo block 704 includes a block pointer to an inode file 706 thatcontains inodes of a plurality of files, including an owner map 710, anactive map 712, a summary map 714 and a space map 716, as well as otherspecial metadata files. The inode file 706 further includes a rootdirectory 720 and a “hidden” metadata root directory 730, the latter ofwhich includes a namespace having files related to a vvol in which userscannot “see” the files. The hidden metadata root directory also includesthe WAFL/fsid/ directory structure that contains filesystem file 740 andstorage label file 790. Note that root directory 720 in the aggregate isempty; all files related to the aggregate are organized within thehidden metadata root directory 730.

In addition to being embodied as a container file having level 1 blocksorganized as a container map, the filesystem file 740 includes blockpointers that reference various file systems embodied as vvols 750. Theaggregate 700 maintains these vvols 750 at special reserved inodenumbers. Each vvol 750 also has special reserved inode numbers withinits vvol space that are used for, among other things, the blockallocation bitmap structures. As noted, the block allocation bitmapstructures, e.g., active map 762, summary map 764 and space map 766, arelocated in each vvol.

Specifically, each vvol 750 has the same inode file structure/content asthe aggregate, with the exception that there is no owner map and noWAFL/fsid/filesystem file, storage label file directory structure in ahidden metadata root directory 780. To that end, each vvol 750 has avolinfo block 752 that points to one or more fsinfo blocks 800, each ofwhich may represent a snapshot, along with the active file system of thevvol. Each fsinfo block, in turn, points to an inode file 760 that, asnoted, has the same inode structure/content as the aggregate with theexceptions noted above. Each vvol 750 has its own inode file 760 anddistinct inode space with corresponding inode numbers, as well as itsown root (fsid) directory 770 and subdirectories of files that can beexported separately from other vvols.

The storage label file 790 contained within the hidden metadata rootdirectory 730 of the aggregate is a small file that functions as ananalog to a conventional raid label. A raid label includes physicalinformation about the storage system, such as the volume name; thatinformation is loaded into the storage label file 790. Illustratively,the storage label file 790 includes the name 792 of the associated vvol750, the online/offline status 794 of the vvol, and other identity andstate information 796 of the associated vvol (whether it is in theprocess of being created or destroyed).

D. Sparse Volumes

The present invention overcomes the disadvantages of the prior art byproviding a system and method for instantiating a sparse volume within afile system of a storage system that is used to restore data from asecondary storage system (backing store) on demand. As described herein,a sparse volume contains one or more files with at least one data block(i.e., an absent block) that is not stored locally on disk (i.e., on alocal volume) coupled to the storage system. By not storing the datablock (or a block of zeros as in a hole environment), the sparse volumemay be generated and exported quickly with minimal write operationsrequired. The missing data of an absent block is stored on thealternate, possibly remote, backing store and is illustrativelyretrieved using a remote fetch operation.

The sparse volume is identified by a special marking of an on-diskstructure of the volume (vvol) to denote the inclusion of a file with anabsent block. FIG. 8 is a schematic block diagram of the on-diskstructure, which illustratively is an exemplary fsinfo block 800. Thefsinfo block 800 includes a set of persistent consistency point image(PCPI) pointers 805, a sparse volume flag field 810, an inode for theinode file 815 and, in alternate embodiments, additional fields 820. ThePCPI pointers 805 are dual vbn (vvbn/pvbn) pairs of pointers to PCPIs(snapshots) associated with the file system. The sparse volume flagfield 810 identifies whether the vvol described by the fsinfo block issparse. In the illustrative embodiment, a flag is asserted in field 810to identify the volume as sparse. The sparse volume flag field 810 maybe embodied as a type field identifying the type of a vvol associatedwith the fsinfo block. The inode for the inode file 815 includes theinode containing the root-level pointers to the inode file 760 (FIG. 7)of the file system associated with the fsinfo block.

Appropriate block pointer(s) of the file are marked (labeled) withspecial ABSENT value(s) to identify that certain block(s), includingdata and/or indirect blocks, within the sparse volume are not physicallylocated on the storage system serving the volume. The special ABSENTvalue further alerts the file system that the data is to be obtainedfrom the alternate source, namely a remote backing store, which isillustratively near line storage server 120 b. In response to a dataaccess request, the Load_Block( ) function 284 of the file system 280detects whether an appropriate block pointer of a file is marked asABSENT and, if so, transmits a remote fetch (e.g., read) operation fromthe storage system to the remote backing store to fetch the requireddata. The fetch operation illustratively requests one or more file blocknumbers (fbns) of the file stored on the backing store. It should benoted that while the present description is written in terms of a singlebacking store, the principles of the present invention may be applied toan environment where a single sparse volume is supported by a pluralityof backing stores, each of which may support the entire or a subset ofthe sparse volume. As such, the teachings should not be taken to belimited to single backing stores.

The backing store retrieves the requested data from its storage devicesand returns the requested data to the storage system, which processesthe data access request and stores the returned data in its memory.Subsequently, the file system “flushes” (writes) the data stored inmemory to local disk during a write allocation procedure. This could bein response to the data being marked as “dirty,” or other notationdenoting to the file system that the data must be write allocated. Inaccordance with an illustrative write anywhere policy of the procedure,the file system assigns pointer values (other than ABSENT values) toindirect block(s) of the file to thereby identify location(s) of thedata stored locally within the local volume. Thus, the remote fetchoperation is no longer needed to access the data.

An example of a write allocation procedure that may be advantageouslyused with the present invention is described in U.S. Pat. No. 7,430,571,issued on Sep. 30, 2008, titled EXTENSION OF WRITE ANYWHERE FILE LAYOUTWRITE ALLOCATION, by John K. Edwards, which application is herebyincorporated by reference. Broadly stated, block allocation proceeds inparallel on the flexible vvol and aggregate when write allocating ablock within the vvol, with a write allocator process 282 selecting anactual pvbn in the aggregate and a vvbn in the vvol. The write allocatoradjusts block allocation bitmap structures, such an active map and spacemap, of the aggregate to record the selected pvbn and adjusts similarstructures of the vvol to record the selected vvbn. A vvid (vvolidentifier) of the vvol and the vvbn are inserted into owner map 710 ofthe aggregate at an entry defined by the selected pvbn. The selectedpvbn is also inserted into a container map (not shown) of thedestination vvol. Finally, an indirect block or inode file parent of theallocated block is updated with one or more block pointers to theallocated block. The content of the update operation depends on the vvolembodiment. For a dual vbn hybrid vvol embodiment, both the pvbn andvvbn are inserted in the indirect block or inode as block pointers.

FIG. 9 is a flow chart detailing the steps of a procedure 900 forservicing a data access request (e.g., a read request) directed to asparse volume. The procedure begins in step 905 and continues to step910, where the storage system receives a data access request from aclient. The data access request is processed by the file system in step915 by, for example, converting the request to a set of file systemprimitive operations. Then, in step 917, the appropriate file geometryand inode data is loaded. This may be accomplished using the Load_Inode() 292 function, which is further described in the above-incorporatedU.S. patent application Ser. No. 11/409,887, entitled SYSTEM AND METHODFOR SPARSE VOLUMES, by Jason Lango, et al., and U.S. patent applicationSer. No. 11/409,624, entitled ARCHITECTURE FOR SUPPORT OF SPARSEVOLUMES, by Jason Lango et al. Generally, the file geometry and inodedata permits the storage system to identify the appropriate amount ofspace to reserve when restoring a file (or other data container) thathas ABSENT blocks.

In step 920, the file system identifies one or more blocks to be loadedand, in step 925, invokes the Load_Block( ) function to load one or moreof the identified blocks. In step 930, a determination is made as towhether the block(s) is marked ABSENT. This determination may be made,for example, by examining a block pointer referencing the block. If theblock is not absent, the procedure branches to step 935 where the blockis retrieved from disk and, in step 940, the data access request isperformed. In the case of a read request, performance of the requestincludes returning the retrieved data to the client. The procedure thencompletes in step 965.

However, if the block is absent (step 930), the procedure continues tostep 945, where a remote data access (fetch) request is sent to abacking store to fetch the requested block(s). The fetch request may beissued by a fetch module of the storage operating system, such as theexemplary NRV protocol mentioned herein. As noted above, a plurality ofbacking stores may be utilized with a sparse volume. In the example ofan environment with a plurality of backing stores, metadata contained ina sparse configuration metadata file 732 identifies the appropriatebacking store to utilize. The backing store receives the remote dataaccess request and responds with the requested data in step 950. In step955, the data access request is performed with the data retrieved fromthe backing store. Subsequently, write allocation is performed to storethe retrieved data on one or more local storage devices in step 960. Theprocedure then completes in step 965.

E. Restore on Demand (ROD)

In the illustrative embodiment, a sparse volume is initialized withvolume infrastructure metadata that utilizes pointers (e.g., ABSENTpointers) to data stored on the backing store. Use of these ABSENTpointers present a user, such as a client, with the illusion of an“instant” full restore, thereby avoiding the long wait associated with aconventional full restore operation. The data may then be “restored ondemand,” which as used herein denotes waiting until a specific requestfor the data is issued before expending storage system resources toacquire the data. Such restoration of data may be accomplished inresponse to a client issuing a data access request to the storagesystem, or by a restore module of the system generating a request(“demand”) for the data during, e.g., background processing. Inaccordance with the present invention, a sparse volume may beinstantiated to quickly restore a local volume that has failed. Itshould be noted that once a restoration of a sparse volume has begun,the sparse volume available for all file system operations including,e.g., new modifications (write operations). Any operations may beperformed to the sparse volume once restore on demand has beeninitiated. For example, a back up operation may be initiated to a sparsevolume.

FIG. 10 is a flow chart detailing steps of a procedure 1000 for quicklyrestoring a failed local volume using a sparse volume. The procedurebegins in step 1005 and continues to step 1010, where the local volumeof the storage system is determined to have failed. The failed volumemay be any of a plurality of volumes of the storage system. At step1015, the sparse volume is instantiated (created) by an administrator oran automated process, by, e.g., entering certain information (e.g.volume name) associated with the sparse volume into the system via theUI 275. In step 1020, the storage system fetches the volumeinfrastructure metadata from the backing store that is needed toinitialize the sparse volume. Typically, the backing store will containan up-to-date copy of this meta-data for the failed volume, but it mayalso be desirable to restore the metadata from a PCPI or snapshot. Thevolume infrastructure metadata fetched includes the current file systemversion, the total size of the volume (number of inodes and/or number ofblocks), the content of the root file system directory (root_dir) andother file system specific metadata store in, e.g., volinfo and fsinfodata structures. Notably, the file system data of the sparse volume isabsent, as manifested by certain blocks of the inode file beingpopulated (initialized) with ABSENT pointers in step 1025.

Once the infrastructure of the sparse volume is created, at step 1030,the volume is available for any client access. It should be noted thatafter failure of the local volume, clients may be required to unmountand remount the restored (sparse) volume to ensure that they operate onvalid data, rather than previously cached versions of “stale” data. Forclient-issued requests, restoration of data, including any file systemdata and remaining metadata, may be accomplished as described above withreference to FIG. 9. In order to restore (retrieve) such data, onlylogical file information, such as file identification (file ID) numbers,file handles, and offset values need to be transferred between thestorage system (primary) and the backing store (secondary). The backingstore then returns the requested data to the storage system, whichperforms write allocation on that data. As a result, “fresh” blockallocation information is created for the sparse volume, including newpvbns and vvbns in accordance with the write allocation proceduredescribed above. Thus, it is not necessary to transfer any writeallocation files (inode map, summary map, active map, etc.) between thesystems. The procedure then ends at step 1035.

The following example describes how a client can access its data ondemand once a sparse volume is instantiated to quickly restore a failedlocal volume. Assume the client wishes to access a file “document.doc”from its directory served by the storage system 120 a. The file systemaccesses the root directory to locate the file in a conventional manner.If the file system encounters any absent blocks within the buffer treeof the file, the blocks are restored from the backing store as describedherein. For instance, assume that document.doc is located in a“../users/client/” directory, neither of which are present on the sparsevolume. The file system 280 cooperates with the NRV module 295 to issuefetch requests for data from the backing store needed to populate the“users/” directory in order to find the “client/” directory, and thensubsequently populate the “client/” directory to locate the document.docfile. Note that while populating the “client/” directory, otherdirectories found in the “users!” directory are not populated, andremain absent (thus space is reserved for the other directories) untilneeded at a later time. With the file ID and file handle ofdocument.doc, the primary may then restore the file from the secondaryin accordance with the present invention. It is possible that theprimary storage system may only fetch the particular block or blocks ofthe file that are requested, and not the entire file. An example of whenthis may occur is when servicing a client request for a thumbnail of afile in a Microsoft® WINDOWS™ environment.

Demand Generator

In addition to restoring absent data on the sparse volume in response toclient requests, it may be desirable to ensure that the entire contentof the volume is restored as quickly as possible, yet with minimal fileservice disruption. Entire volume restoration is desirable because eachclient access to the remote backing store generates a retrieval delay.Once all of the volume data is restored locally, this delay no longerexists. Also, in the event that the backing store becomes unavailable,data not yet restored on the primary storage system may be lost. Thiswindow of vulnerability can be reduced by implementing a restore moduleof the storage operating system 200 to run as a background process. Itshould be noted that if the backing store becomes unavailable, theprimary storage system may continue to service data access operationsuntil the backing store becomes available and the restore on demandprocess is restarted. The primary storage system will be able to processwrite operations and serve any read operations directed to data that hasalready been restored.

According to an aspect of the invention, the restore module is embodiedas the novel demand generator 296 configured to scan the sparse volume,searching for blocks with ABSENT pointers. Upon locating such a block,the demand generator initiates a remote fetch operation to retrieve themissing data referenced by each ABSENT pointer from the backing store.The retrieved data is then write allocated to populate the sparsevolume. Population of the sparse volume with missing data preferablyoccurs in connection with a multi-phase projected sequence until thereare no absent blocks remaining in the file system. At this time, thesparse volume transitions to a fully restored detached local volume.

FIG. 11 is a flow chart illustrating a procedure 1100 for operating thedemand generator in accordance with the present invention. The procedurestarts at step 1105 and continues to step 1110 where a determination ismade as to whether the volume is a sparse volume. In the illustrativeembodiment, this determination is preferably rendered by the file system280, by, as described above, a special indicator or flag located in thesparse volume flag field 810 of the fsinfo block 800 for the volume. Ifthe volume is not sparse, the procedure ends at step 1155. If, on theother hand, the volume is a sparse volume, the procedure continues tostep 1112, where the file system cooperates with the demand generator to“walk though” the sparse volume searching for absent blocks. Here, thedemand generator illustratively invokes a scanner process 286 of thefile system to walk through the volume. Specifically, the scanner startsat a top-level inode, such as the inode of the inode file, and traversesa projected sequence to a last file of the file system. At step 1115,the scanner initializes to a first file of the projected sequence bysetting the desired file identifier (ID) to the first file of the filesystem. It should be noted that in the illustrative WAFL file system,the first file ID may belong to a specific file system file that shouldalready have been recovered (root_dir, active map, etc.), so the file IDfor the first actual file may be a value greater than zero (or one).

In step 1120, the scanner scans the blocks of a buffer tree of the file,and in step 1125 determines whether any blocks contain an ABSENTpointer, thus indicating that blocks of the file are absent. If theblocks do not contain an ABSENT pointer, then in step 1128 adetermination is made as to whether this is the last file in the sparsevolume. If so, the procedure ends in step 1155. If not, the scannerproceeds to the next file, e.g. by incrementing the file ID number instep 1130. The procedure then returns to step 1120.

If an absent block is encountered in step 1125, however, the scannersignals the demand generator to proactively request the data for theabsent block from the backing store. In an alternate embodiment, thescanner issues a conventional read request directed to the data. Thisread request will trigger the fetch operation without invoking thedemand generator. In step 1135, the demand generator issues a remotedata access (fetch) request to the backing store to fetch a requesteddata block. The backing store receives the remote data access requestand responds with the requested data in step 1140.

Subsequently, write allocation is performed on the retrieved data tostore the data on one or more storage devices of the sparse volume instep 1145. During the normal course of write allocation, the remainingportions of the buffer tree for the file are created. Next, at step1128, it is determined if the file is the last file in the sparsevolume, and if so, the procedure ends in step 1155. If not, the scannerproceeds to the next file by incrementing the file ID in step 1130, andreturns to step 1120 to scan the blocks of the file. This processcontinues until all absent blocks have been restored, or until theprocess is manually stopped. Before reaching the last file in the sparsevolume, the file system may also be notified of having restored the lastabsent block, such as through the sparse volume indication field, inwhich case the file system may then end the process.

FIG. 12 is a flow chart illustrating an embodiment of the projectedsequence traversed by the scanner when scanning a sparse volume. Here,the scanner illustratively traverses a multi-phase projected sequence topopulate the sparse volume with missing data. Procedure 1200 starts atstep 1205, and continues to step 1210, where the scanner cooperates withthe demand generator to restore the blocks of the inode file.Thereafter, in step 1215, the directories are restored, followed by thefiles in step 1220. As will be understood by those skilled in the art,the inode file and directories are restored first in order for the filesystem to reach a consistency state as early as possible. The procedurethen ends in step 1225.

The demand generator may also be configured to utilize a special loadpath that bypasses the buffer cache 170 of the storage system 120A so asnot to “pollute” that cache with retrieved data not currently needed bythe client. For example, while the file system 280 cooperates with theNRV protocol module 295 to restore files that a client 110 may currentlybe accessing for an application 112, which files may need to be cachedfor faster continued access, the demand generator 296 may furtherrequest the restoration of files that are not currently needed by theclient. Hence, these files do not need to be stored in the buffer cache170 after write allocation to the local volume. It is also important tonote that a client request need not be stored on a cache of thesecondary backing store because once the date is restored, the primarystorage system no longer needs to access it on the secondary.

One way to implement the special load path is to mark the demandgenerated data in the buffer cache as unnecessary so it may be promptlyremoved from cache once the data is written to disk. Marking ofunnecessary data may be effected through a modified use of a leastrecently used (LRU) algorithm. When data is to be marked as unnecessary,the cache block (buffer) containing the data is placed at the beginningof an LRU stack, as opposed to the end, so that it is the first bufferto be reused. Alternatively, a new load path transmission link may becreated, which physically bypasses any unnecessary caches; however, thisalternate solution may require hardware modification. It will beunderstood by those skilled in the art that other methods of preventingcache pollution may be used within the scope of the present invention.

In addition, the demand generator 296 may implement a read-ahead featureto enhance retrieval of data associated with a sequence of remote fetchoperations. Read-ahead algorithms that may be advantageously employed bythe demand generator are described in copending U.S. Pat. No. 7,333,993,entitled ADAPTIVE FILE READAHEAD TECHNIQUE FOR MULTIPLE READ STREAMS, byRobert L. Fair, and U.S. Pat. No. 7,631,148, entitled ADAPTIVE FILEREADAHEAD BASED ON MULTIPLE FACTORS, by Robert L. Fair, which are bothexpressly incorporated herein by reference. The demand generator maycooperate with the file system 280 to employ a speculative read-aheadoperation that retrieves blocks that are likely to be requested bysubsequent fetch operations. For example, in response to a read requestto retrieve a sequence of consecutive blocks, the file system may invokeread-ahead operations to retrieve additional blocks that further extendthe sequence, even though those blocks have yet to be requested by thedemand generator. As an example, this could be useful when reading asequential series of absent blocks.

In still another embodiment, it is possible to utilize multiple demandgenerators executing in parallel in the storage operating system toexpedite data restoration. Here, each demand generator is responsiblefor only a portion of the blocks in the sparse volume. For example, twodemand generators could divide a task into equal portions, wherein thefirst demand generator is responsible for the first half of sequentialaddresses and the second generator is responsible for the second half.Those skilled in the art should understand that there are manyalternative configurations for multiple demand generators, and thatthose variant configurations are within the scope and protection of thepresent invention.

Pump Module

According to another aspect of the present invention, the pump module298 provides flow control to regulate the processing of demandsgenerated by the demand generator 296, as well as the requests issued byclients 110. Flow control may be needed because the scanner and demandgenerator are capable of issuing and generating access requests for theblocks of the file system substantially faster than the time required tofetch and restore those blocks from the backing store, primarily becausethe fetch and restore operations are impacted by network latency, diskaccess delays (as the data may not be located in the secondary backingstore's cache), or other external delays. Accordingly, these fetch andrestore operations may become a “bottleneck” with respect to performanceof the system, resulting in a “backlog” of outstanding demands andrequests.

In the event the number of outstanding demands and requests for datamissing from the sparse volume reaches a predetermined threshold, thepump module 298 regulates the demand generator 296 to slow down ortemporarily pause the number of demands it generates. The pump modulemay further implement a priority policy that, e.g., grants precedence toclient issued requests over generated demands for missing data in theevent available system resources are limited.

FIG. 13 is a flowchart illustrating a procedure 1300 for implementingflow control using the pump module. The procedure starts at step 1305,and continues to step 1310 where data access requests from the clientare monitored. At the pump module by, e.g., recording the size andnumber of those requests. At step 1315, the pump module also monitorsthe data access requests (demands) generated by the demand generator. Inthe event the number of requests/demands from the demand generatorreaches a predetermined threshold (e.g. a maximum number of requestsallowed for the demand generator) at step 1320, the pump moduleregulates the demand generator at step 1325 to slow or decrease thenumber of generated demands. Regulation, in this context, may beaccomplished in a number of ways, including throttling the scanner, e.g.by adjusting the rate of restore traffic (e.g. 100 kB/sec max), or bypausing it temporarily (until allowing it to resume operation at a latertime).

The pump module may also function as a priority mechanism for thedemands generated by the demand generator and data access requestsissued by the client. In order to maintain the appearance of normaloperation during a restore on demand operation, the demand generatormust not consume the bandwidth available for restoration, leaving theclient with over-delayed file access times. To insure that thissituation is avoided, the pump module grants precedence to the requestsfrom the client over those from the demand generator. Specifically, ifthe demand generator has not reached the maximum threshold at step 1320or is regulated at step 1325, the pump module determines at step 1330 ifthe demand generator is over-utilizing (consuming) resources, in amanner that limits available resources for client data access requests.If so, the pump module grants priority to the client data accessrequests at step 1335 by, e.g., placing the demand generator on hold,and then returns to step 1310 to further monitor the requests. If it isnot consuming an abundance of valuable resources, the demand generatoris allowed to continue unheeded, and the procedure returns to step 1310.Other types of requests may have different levels of priority, such as,for example, high priority for special file system commands, or lowpriority for read-aheads.

In one embodiment of the present invention, the pump module may beembodied as a plurality of threads organized to function as a type ofqueue, through which all data fetch requests to the secondary flow. Thepump module may generate the actual fetch requests, using, e.g., theexemplary NRV protocol. Each thread may be assigned to process (i.e.,generate and transmit) one request at a time, and wait for a response tothat request prior to processing a next request. Yet, through the use ofmultiple threads, requests can complete out of order. This is similar towhat is generally referred to in the art as a leaky bucket algorithm.Moreover, the demand generator can issue demands to the pump module solong as there are a predetermined number of threads available. Forexample, the pump module may be configured with one hundred threads(i.e., a “one hundred request queue length”), that allows essentiallyunlimited service of client-issued requests, but that also limitsservice of demand-generated demands so long as at least ten threads areavailable. Thus, if no client requests are issued, the demand generatorcan send up to ninety demands at any one time.

It should be noted that the teachings of the present invention may beutilized with thinly provisioned volumes. Certain file systems,including the exemplary write anywhere file layout (WAFL) file systemavailable from Network Appliance, Inc, of Sunnyvale, Calif., include thecapability to generate a thinly provisioned data container, wherein thedata container is not completely written to disk at the time of itscreation. As used herein, the term data container generally refers to aunit of storage for holding data, such as a file system, disk file,volume or a logical number (LUN), which is addressable by, e.g., its ownunique identification. The storage space required to hold the datacontents of the thinly provisioned data container on disk has not yetbeen used. Thinly provisioned data containers are further described inU.S. Pat. No. 7,603,532, entitled SYSTEM AND METHOD FOR RECLAIMINGUNUSED SPACE FROM A THINLY PROVISIONED DATA CONTAINER, by Vijajan Rajan,et al.

In a restore on demand environment, the use of thinly provisioned datacontainers may be utilized for a primary storage system in the event ofa disaster recovery scenario. By utilizing a thinly provisioned primarydata container, which only has physical storage for a portion of thetotal amount of the primary volume, an administrator does not need toprocure physical storage in the total size of the secondary, but canonly provision the amount of space needed for files/data containers thatwill be used.

To again summarize, the present invention is directed to a system andmethod for implementing restore on demand (ROD) operation on a sparsevolume of a computer, such as a storage system. The sparse volume is adata container or volume wherein one or more files contained thereinrequire a special retrieval operation to obtain the data. According tothe present invention, the sparse volume may be used to quickly restorethe use of a local storage device once it has failed. The volume ispopulated with absent blocks, and the data is then restored on demand.The restoration of data may be accomplished as client data accessrequests are received, or by a demand generator. As noted above, thedemand generator may also be regulated by a pump module. Additionally,as noted above, once the restoration of a sparse volume has initiated,the volume is illustratively available for all data access operations sothat, for example, write operations may be performed, back up operationsinitiated, etc.

The foregoing description has been directed to specific embodiments ofthis invention. It will be apparent, however, that other variations andmodifications may be made to the described embodiments, with theattainment of some or all of their advantages. For instance, it isexpressly contemplated that the teachings of this invention can beimplemented as software, including a computer-readable medium havingprogram instructions executing on a computer, hardware, firmware, or acombination thereof. Accordingly this description is to be taken only byway of example and not to otherwise limit the scope of the invention.Therefore, it is the object of the appended claims to cover all suchvariations and modifications as come within the true spirit and scope ofthe invention.

1. A computer method, comprising: executing a storage operating systemon a storage system serving a volume; replacing the volume with a sparsevolume served by the storage system, the sparse volume comprising a treestructure with at least one pointer referencing data that is not storedlocally in the tree structure; storing, within the sparse volume, volumeinfrastructure metadata of the volume served by the storage system; andreceiving a request for the data referenced by the at least one pointerand copying the data referenced by the at least one pointer to thesparse volume in response to receiving the request.
 2. The method ofclaim 1 wherein the request is a client request.
 3. The method of claim1 wherein copying comprises copying the data from a second storagesystem.
 4. The method of claim 1 wherein the request is generated by ademand generator, the demand generator scanning the sparse volume tolocate the at least one pointer referencing the data that is not storedlocally in the tree structure of the volume.
 5. The method of claim 4further comprising providing flow control for the demand generator. 6.The method of claim 4 further comprising granting precedence to a clientissued request over a demand generator generated request.
 7. The methodof claim 1 wherein copying comprises a remote fetch operation.
 8. Themethod of claim 1 further comprising: instantiating the sparse volume;and processing data access operations directed to the sparse volume atany time after instantiation.
 9. The method of claim 1 furthercomprising marking the at least one pointer with an ABSENT value toindicate that the data that is not stored locally in the tree structure.10. The method of claim 1 wherein the tree structure is a buffer tree.11. The method of claim 1 further comprising copying the volumeinfrastructure metadata from a second storage system.
 12. The method ofclaim 1 wherein the volume infrastructure metadata comprises at leastone of a current file system version, a total size of the volume, andcontent of a root file system directory.
 13. The method of claim 1wherein the volume is a virtual volume.
 14. A computer data storagesystem, comprising: a processor configured to execute a storageoperating system of a storage system serving a volume; the storagesystem configured to create a sparse volume to replace the volume, thesparse volume comprising a tree structure with at least one pointerconfigured to reference data that is not stored locally in the treestructure; the sparse volume configured to store volume infrastructuremetadata of the volume served by the storage system; and the storagesystem further configured to receive a request for the data referencedby the at least one pointer and further configured to copy the datareferenced by the at least one pointer to the sparse volume in responseto receiving the request.
 15. The computer data storage system of claim14 wherein the request is a client request.
 16. The computer datastorage system of claim 14 wherein the data is copied from a secondstorage system.
 17. The computer data storage system of claim 14 furthercomprising a demand generator configured to generate the request. 18.The computer data storage system of claim 17 further comprising a pumpmodule configured to provide flow control for the demand generator. 19.The computer data storage system of claim 18 wherein the pump module isfurther configured to grant precedence to a client issued request over ademand generator generated request.
 20. The computer data storage systemof claim 14 wherein the storage system is further configured to copy thedata using a remote fetch operation.
 21. The computer data storagesystem of claim 14 wherein the storage system is further configured toinstantiate the sparse volume and further configured to process dataaccess operations directed to the sparse volume at any time afterinstantiation of the sparse volume.
 22. The computer data storage systemof claim 14 wherein the storage system is further configured to mark theat least one pointer with an ABSENT value to indicate that the data thatis not stored locally in the tree structure.
 23. The computer datastorage system of claim 14 wherein the tree structure is a buffer tree.24. The computer data storage system of claim 14 further comprising asecond storage system configured to store the volume infrastructuremetadata.
 25. The computer data storage system of claim 14 wherein thevolume infrastructure metadata comprises at least one of a current filesystem version, a total size of the volume, and content of a root filesystem directory.
 26. The computer data storage system of claim 14wherein the volume is a virtual volume.
 27. A computer readable storagemedium configured to store executable program instructions to beexecuted by a processor, the computer readable storage mediumcomprising: program instructions that execute a storage operating systemon a storage system serving a volume; program instructions that replacethe volume with a sparse volume served by the storage system, the sparsevolume comprising a tree structure with at least one pointer referencingdata that is not stored locally in the tree structure; programinstructions that store, within the sparse volume, volume infrastructuremetadata of the volume served by the storage system; and programinstructions that receive a request for the data referenced by the atleast one pointer and program instructions that copy the data referencedby the at least one pointer to the sparse volume in response toreceiving the request.
 28. The computer readable storage medium of claim27 further comprising program instructions that generate, by a demandgenerator, the request.